perm filename PATAIM[4,KMC] blob sn#110830 filedate 1974-07-05 generic text, type T, neo UTF8
00100	COMMENT āŠ—   VALID 00002 PAGES
00200	C REC  PAGE   DESCRIPTION
00300	C00001 00001
00400	C00002 00002
00500	C00038 ENDMK
00600	CāŠ—;
     

00100	
00200	
00300	      PATTERN-MATCHING RULES FOR THE RECOGNITION OF
00400	         NATURAL LANGUAGE DIALOGUE EXPRESSIONS
00500	
00600			Kenneth Mark Colby
00700	                      and
00800		        Roger C. Parkison
00900	
01000	
01100		            INTRODUCTION
01200	
01300		To recognize something is to identify it as  an  instance  of
01400	the "same again".   This familiarity is possible because of recurrent
01500	characteristics of  the  world  which  repeat  themselves.  We  shall
01600	describe  an  algorithm which recognizes recurrent characteristics of
01700	natural language dialogue  expressions.  It  utilizes  a  multi-stage
01800	sequence  of pattern-matching rules for progressively transforming an
01900	input expression until  it  eventually  matches  an  abstract  stored
02000	pattern.   The stored pattern has a pointer to a response function in
02100	memory which decides what to do once the input has  been  recognized.
02200	Here  we  discuss  only  the  recognizing  functions,  except for one
02300	response function (anaphoric substitution) which  interactively  aids
02400	the  recognition  process.    Details  of  how the response functions
02500	operate will be described in a future communication by William Faught
02600	and ourselves.
02700		We are constructing and  testing  a  simulation  of  paranoid
02800	thought  processes;  our  problem is to reproduce paranoid linguistic
02900	behavior  in  a  teletyped  diagnostic  psychiatric  interview.   The
03000	diagnosis   of  paranoid  states,  reactions  or  modes  is  made  by
03100	clinicians who judge the degree of correspondence between  what  they
03200	observe  in  an  interview  and  their  conceptual  model of paranoid
03300	behavior.   There  exists  a   high   degree   of   agreement   among
03400	psychiatrists about this conceptual model which relies mainly on what
03500	an interviewee says and how he says it.
03600		Natural  language  is a life-expressing code which people use
03700	for  communication  with  themselves  and  others.   In  a  real-life
03800	dialogue  such  as  a  psychiatric  interview,  the participants have
03900	interests, intentions, and expectations which are revealed  in  their
04000	linguistic  expressions.     An  interactive simulation of a paranoid
04100	patient must be  able  to  demonstrate  typical  paranoid  linguistic
04200	behavior.   To  achieve this effect, our paranoid model must have the
04300	ability to deal with the teletyped messages of an interviewer.
04400		A  number  of  approaches  have  been  taken for dealing with
04500	natural language dialogue expressions.  (Winograd,1972;  Woods,1970).
04600	These  approaches  rely on parsers which conduct a detailed syntactic
04700	and semantic analysis. They perform well for the purposes  for  which
04800	they  were  designed. Their weakness, for our purposes, lies in their
04900	lack of neglecting and  ignoring  mechanisms.   Such  mechanisms  are
05000	necessary  in  a  program  which accepts and responds to unrestricted
05100	conversational English characterized  by  expressions  novel  to  the
05200	program.
05300		How humans process natural language is largely unknown.  They
05400	possess  some  knowledge of grammatical rules, but this fact does not
05500	entail  that  they  use  a  grammar  in  interpreting  and  producing
05600	language.  It  seems  implausible  to  us  that  people  possess full
05700	transformational grammars for processing  language.      Language  is
05800	what  is  recognized but the processes involved may not be linguistic
05900	or grammatical.      Originally transformational  grammars  were  not
06000	designed  to "understand" a large subset of English; they constituted
06100	a formal method for deciding whether a string is grammatical.
06200		An analysis of what one's problem actually  is  should  guide
06300	the  selection  or  invention of methods appropriate to its solution.
06400	Our problem is not to develop a  consistent  and  general  theory  of
06500	language  nor  to  assert  empirically  testable hypotheses about how
06600	people process language.    Our problem is  to  design  an  algorithm
06700	which  recognizes  what is being said in a dialogue and what is being
06800	said about it in order to make a response such that a sample  of  I-O
06900	pairs  from  the  paranoid model is judged similar to a sample of I-O
07000	pairs  from  paranoid  patients.      The  design  task  belongs   to
07100	artificial  intelligence in which the criterion is how adequately the
07200	computer program performs mind-like functions.  New methods had to be
07300	devised  for  an  algorithm  to  participate in a human dialogue in a
07400	paranoid-patient-like way.   We sought effective methods which  could
07500	operate  efficiently  in  real  time.     Since our method provides a
07600	general way of many-to-one mapping  from  surface  expressions  to  a
07700	single  stored  pattern,  it  is  not  limited  to  the simulation of
07800	paranoia, but can be used by any type of "host"  system  which  takes
07900	natural language as input.
08000		Our method is to transform  the  input  until  a  pattern  is
08100	obtained which matches completely or partially a more abstract stored
08200	pattern.  This strategy  has  proved  adequate  for  our  purposes  a
08300	satisfactory  percentage  of the time.   The power of this method for
08400	natural  language  dialogues  lies  in  its  ability  to  ignore   as
08500	irrelevant  some  of  what  it  recognizes and everything it does not
08600	recognize  at  all.    A  linguistic   parser   doing   word-by-word,
08700	parts-of-speech analysis fails when it cannot find one or more of the
08800	input words in its dictionary. A system that must know every word  is
08900	too fragile for unrestricted dialogues.
09000		In early versions of the paranoid model, such as PARRY1, some
09100	of  the  pattern  recognition  mechanisms allowed the elements of the
09200	pattern to be order independent (Colby, Weber, and Hilf, 1971).   For
09300	example, consider the following expressions:
09400		(1) WHERE DO YOU WORK?
09500		(2) WHAT SORT OF WORK DO YOU DO? 
09600		(3) WHAT IS YOUR OCCUPATION? 
09700		(4) WHAT DO YOU DO FOR A LIVING? 
09800		(5) WHERE ARE YOU EMPLOYED? 
09900	In  PARRY1  a  procedure  scans  these  expressions  looking  for  an
10000	information-bearing  contentive  such as "work", "for a living", etc.
10100	When it finds such a contentive along with "you"  or  "your"  in  the
10200	expression,  regardless  of word order, it responds to the expression
10300	as if it were a question about the nature of one's work. This  method
10400	correctly  classifies  the  five  sentences  above. Unfortunately, it
10500	includes the two examples below in the same category:
10600		(6) DOES YOUR FATHER'S CAR WORK?
10700		(7) HOW DID THINGS WORK OUT FOR YOU?
10800	An  insensitivity  to word order has the advantage that lexical items
10900	representing  different  parts  of  speech  can  represent  the  same
11000	concept,e.g.  the  word "work" represents the same concept whether is
11100	used as a noun or a verb. But a price is paid for this resilience and
11200	elasticity.   We  find  from  experience  that,  since English relies
11300	heavily on word order to convey the  meaning  of  its  messages,  the
11400	average   penalty  of  misunderstanding  (to  be  distinguished  from
11500	ununderdstanding),  is  too  great.  Hence  in  PARRY2,  as  will  be
11600	described shortly, all the patterns require a specified word order.
11700		For  high-complexity  problems  it   is   helpful   to   have
11800	constraints.        Diagnostic psychiatric interviews (and especially
11900	those conducted over teletypes)  have  several  natural  constraints.
12000	First,  clinicians  are  trained  to ask certain questions in certain
12100	ways.    This limits the number of  patterns  required  to  recognize
12200	utterances  about  each  topic.  Second,  only a few hundred standard
12300	topics are brought up by interviewers who are,  furthermore,  trained
12400	to  use everyday expressions and especially those used by the patient
12500	himself.  When the interview is conducted by  teletypes,  expressions
12600	tend  to  be  shortened  since  the interviewer tries to increase the
12700	information transmission rate over the slow channel  of  a  teletype.
12800	Finally,   teletyped  interviews  represent  written  utterances  and
12900	utterances are known to be highly redundant  such  that  unrecognized
13000	words  can be ignored without losing the meaning of the message. Also
13100	utterances are loaded with idioms, cliches, pat phrases, etc.       -
13200	all  being  easy  prey  for  a  pattern-matching  approach.    It  is
13300	time-wasting and  usually  futile  to  try  to  decode  an  idiom  by
13400	analyzing the meanings of its individual words.
13500		We  now  describe  the  pattern-matching  functions  of   the
13600	algorithm  in  some  detail. (See Fig. 1 for a diagram of the overall
13700	flow of control).
13800	
13900	
14000			    OVERVIEW
14100	
14200		PARRY2 has  two  primary  modules.   The  first  attempts  to
14300	RECOGNIZE the input and the second RESPONDS.  This paper is primarily
14400	about the  RECOGNIZE  module.   It  functions  independently  of  the
14500	RESPOND  module  except  in the case of pronoun references, which the
14600	RESPOND module provides to the RECOGNIZER on request.
14700		The recognition module has 4 main steps:
14800		1) Identify the words in the question and convert them to
14900			internal synonyms.
15000		2) Break the input into segments at certain bracketing words.
15100		3) Match each segment (independently) to a stored pattern.
15200		4) Match the resulting list of recognized segments to a stored
15300			  compound pattern.
15400	Each  of  these  steps,  except  the  segmenting, throws away what it
15500	cannot identify.  Occasionally a reference to  an  unknown  topic  is
15600	mis-recognized as some familiar topic.
15700	
15800	 		    PREPROCESSING
15900	
16000		Each word in the input expression is first  looked  up  in  a
16100	dictionary  of (currently) about  1900 entries which, for the sake of
16200	speed, is maintained in core during run-time.  Illustrative  examples
16300	*****	KEN	*****
16400	The dictionary is given in ...
16500	*****	KEN	*****
16600	of  dictionary entries are given in Appendix 2. The dictionary, which
16700	was built empirically from thousands  of  teletyped  interviews  with
16800	previous  versions  of the model, consists of words, groups of words,
16900	and names of word-classes they can be translated into.  The  size  of
17000	the  dictionary  is  determined  by  the patterns, i.e. a word is not
17100	included unless it appears in one or more patterns.  Entries  in  the
17200	dictionary  reflect  PARRY2's main interests.  If a word in the input
17300	*****	KEN	*****  THIS FEATURE WAS ADDED RECENTLY
17400	is not in the dictionary, it is checked to see if it ends with one of
17500	the common suffixes given in appendix 2.5.  If it does, the suffix is
17600	removed and the remaining word is looked up again.  If it still....
17700	*****	KEN	*****
17800	is not in the dictionary,  it  is  dropped  from  the  pattern  being
17900	formed. Thus if the input is:
18000		WHAT IS YOUR CURRENT OCCUPATION?
18100	and  the  word  "current"  is   not in the dictionary, the pattern at
18200	this stage becomes:
18300		( WHAT IS YOUR OCCUPATION )   
18400	The question-mark is thrown away as  redundant  since  questions  are
18500	recognized  by  word  order. (A statement followed by a question mark
18600	(YOU GAMBLE?) is responded to in  the  same  way  as  that  statement
18700	followed  by  a  period.) Synonymic translations of words are made so
18800	that the pattern becomes, for example:
18900		( WHAT  BE  YOU  JOB )   
19000		Some groups of words (i.e. idioms) are translated as a  group
19100	so that, for example, "for a living" becomes "for job". Certain other
19200	juxtaposed words are contracted into a single word,  e.g.  "place  of
19300	birth"  becomes  "birthplace".  This  is  done to deal with groups of
19400	words which are  represented  as  a  single  element  in  the  stored
19500	pattern,  thereby preventing segmentation from occurring at the wrong
19600	places, such as at a preposition inside an idiom or  phrase.  Besides
19700	these  contractions, certain expansions are made so that for example,
19800	"DON'T" becomes "DO NOT" and "I'D" becomes "I WOULD".
19900		Misspellings  can  be the bane of teletyped interviews for an
20000	algorithm.  Here  they  are  handled  in  two  ways.  First,   common
20100	misspellings  of  important  words  are simply put in the dictionary.
20200	Thus "yuu" is known to mean "you". The apostrophe  is  often  omitted
20300	from contractions so most contractions are recognized with or without
20400	it. These common misspellings were gathered from over 4000 interviews
20500	with earlier versions of the paranoid model.
20600		Second,  five  common  forms  of  typing  error  are  checked
20700	systematically.  These are:
20800		1) Doubled letter
20900		2) Extraneous letter
21000		3) Forgetting to hold the "shift key" for an apostrophe
21100		4) Hitting a nearby key on the keyboard
21200		5) Transposing two letters in a word
21300	The first three errors can be corrected  by  deleting  the  offending
21400	character  from  the  word.     This is accomplished by deleting each
21500	character in turn until the word is recognized.  The fourth  type  of
21600	error is only checked for eight of the more common near misses. These
21700	were also empirically determined and involve the letter pairs (T  Y),
21800	(Q  W), (Y U), (I O), (G H), (O P), (A S), and (N M).   These methods
21900	are all based on typing errors, but they also correct some legitimate
22000	English  spelling  errors.     Two-letter transposition corrects, for
22100	example, "beleive" to "believe".
22200	
22300	                      SEGMENTING
22400	
22500		Another  weakness  in the crude pattern matching of PARRY1 is
22600	that it takes the entire input expression  as  its  basic  processing
22700	unit.   If  only two words are recognized in an eight word utterance,
22800	the risk of misunderstanding is great. We need a way of dealing  with
22900	units shorter than the entire input expression.
23000		Aided by a heuristic from work in machine-translation (Wilks,
23100	1973  ), we devised a way of bracketing the pattern constructed up to
23200	this  point  into  shorter  segments  using  prepositions,  wh-forms,
23300	certain  verbs, etc. as bracketing points.  (A list of the bracketing
23400	terms  appears  in  Fig.  2).    These  points   tend   to   separate
23500	prepositional  phrases  and  embedded  clauses  from the main clause.
23600	The  new  pattern  formed  is  termed  either  "simple",  having   no
23700	delimiters  within  it,  or "compound", i.e., being made up of two or
23800	more simple patterns. A simple pattern might be:
23900		( WHAT BE YOU JOB )
24000	whereas a compound pattern would be:
24100		(( WHY BE YOU ) ( IN HOSPITAL ))
24200	Our experience with this method of segmentation shows  that  compound
24300	patterns  from teletyped psychiatric dialogues rarely consist of more
24400	than three or four segments. 
24500		After certain verbs (See  Fig.  4)  a  bracketing  occurs  to
24600	replace the commonly omitted "THAT", such that:
24700		( I THINK YOU BE AFRAID )
24800	becomes
24900		(( I THINK ) ( YOU BE AFRAID ))
25000	
25100			   MATCHING INDIVIDUAL SEGMENTS
25200	
25300		Conjunctions serve only as markers for the segmenter and they
25400	are dropped out after segmentation.
25500		Negations are  handled  by  extracting  the  "NOT"  from  the
25600	segment  and  assigning  a value to a global variable which indicates
25700	that the expression is negative in form. When a  pattern  is  finally
25800	matched,  this variable is consulted. Some patterns have a pointer to
25900	a pattern  of  opposite  meaning  if  a  "NOT"  could  reverse  their
26000	meanings.      If this pointer is present and a "NOT" was found, then
26100	the pattern matched is replaced by its opposite, e.g.  ( I not  trust
26200	you  ) is replaced by the pattern ( I mistrust you ). We have not yet
26300	observed the troublesome  case  of  "he  gave  me  not  one  but  two
26400	messages". (There is no need to scratch where it doesn't itch).
26500		Substitutions  are also made in certain cases.  Some segments
26600	contain pronouns which could stand for a number of  different  things
26700	of  importance  to PARRY2.       As we mentioned in the introduction,
26800	the response functions of memory keep track of the context  in  order
26900	to  give  pronouns and other anaphoras a correct interpretation.  For
27000	example, the segment:
27100		( DO YOU AVOID THEM )
27200	could refer to the Mafia, or racetracks, or other patients, depending
27300	on the context.  When such a segment is encountered,  the  pronoun  is
27400	replaced by its current anaphoric value as determined by the response
27500	functions, and a more specific segment such as:
27600		( DO YOU AVOID MAFIA )
27700	is  looked  up.
27800		Other  utterances,  such  as  "Why  did you do that?" or just
27900	"Why?" (which might be regarded as a massive ellipsis), clearly refer
28000	back  to  previous  utterances.   These utterances match very general
28100	patterns which identify the type of question without  indicating  the
28200	exact  topic. The response function which responds to "Why?" consults
28300	the context to produce an appropriate answer.
28400		The algorithm next attempts to match the segments with stored
28500	simple patterns which  currently  number  about  1700.  (Examples  of
28600	*****	KEN	*****
28700	The......
28800	*****	KEN	*****
28900	simple  patterns  appear in Appendix 3). First a complete and perfect
29000	match is sought.   When a match is found, the stored pattern name has
29100	a  pointer to the name of a response function in memory which decides
29200	what to do further.  If a match is not found, further transformations
29300	of the segment are carried out and a "fuzzy" match is tried.
29400		For fuzzy matching at this stage, we  adopted  the  heuristic
29500	rule of dropping elements in the segment one at a time and attempting
29600	a match each time.   This heuristic allows ignoring familiar words in
29700	unfamiliar  contexts.    For example, "well" is important in "Are you
29800	well?" but meaningless in "Well are you?".
29900	 	Deleting one element at a time results  in,  for example, the
30000	pattern:
30100			( WHAT BE YOU MAIN PROBLEM )
30200	becoming successively:
30300			(a) ( BE YOU MAIN PROBLEM )
30400			(b) ( WHAT YOU MAIN PROBLEM )
30500			(c) ( WHAT BE MAIN PROBLEM )
30600			(d) ( WHAT BE YOU PROBLEM )
30700			(e) ( WHAT BE YOU MAIN )
30800	Since the stored pattern in this case matches (d), (e) would  not  be
30900	constructed.   We  found  it  unwise  to delete more than one element
31000	since our segmentation method usually yields  segments  containing  a
31100	small number (1-4) of words.
31200		Dropping  an  element  at  a  time  provides  a   probability
31300	threshold  for  fuzzy   matching which is a function of the length of
31400	the segment. If a segment consists of five elements, four of the five
31500	must be present in a particular order (with the fifth element missing
31600	in any position) for a match to occur. If  a  segment  contains  four
31700	elements, three must match - and so forth.
31800	
31900			   COMPOUND-PATTERN MATCH
32000	
32100		When more than one simple pattern is detected in the input, a
32200	second  matching  is  attempted  against about 500 compound patterns.
32300	Certain patterns, such as ( HELLO ) and (  I  THINK  ),  are  dropped
32400	because  they  are considered meaningless. If a complete match is not
32500	found, then simple patterns are dropped, one  at  a  time,  from  the
32600	complex pattern. This allows the input,
32700		(( HOW DO YOU COME ) ( TO BE ) ( IN HOSPITAL ))
32800	to  match  the  stored  pattern,
32900		(( HOW  DO  YOU COME ) ( IN HOSPITAL )).
33000	
33100		If  no  match  can  be found at this point, the algorithm has
33200	arrived at a default condition and the appropriate response functions
33300	decide  what  to  do.  For example, in a default condition, the model
33400	may assume  control  of  the  interview,  asking  the  interviewer  a
33500	question, continuing with the topic under discussion or introducing a
33600	new topic.
33700		An annotated example of a diagnostic psychiatric interview is
33800	presented in Appendix 1.
33900	
34000	
34100			   ADVANTAGES AND LIMITATIONS
34200	
34300		As   mentioned,   one   of   the   main   advantages   of   a
34400	pattern-matching strategy is that it can ignore  as  irrelevant  both
34500	some  of  what  it  recognizes and what it does not recognize at all.
34600	There are several million words in English, each possessing from  one
34700	to  over  a  hundred  senses.     To  construct a machine-usable word
34800	dictionary of this magnitude is out of the  question  at  this  time.
34900	Recognition  of  natural language input in the manner described above
35000	allows real-time interaction in a dialogue since it  avoids  becoming
35100	ensnarled   in  combinatorial  disambiguations  and  long  chains  of
35200	inferencing  which  would  slow  a   dialogue   algorithm   down   to
35300	impracticality,  if it could even function at all. The price paid for
35400	pattern-matching is that  sometimes,  but  rarely,  ambiguities  slip
35500	through.
35600		Another  advantage of this method is its speed. The algorithm
35700	consists of about 28K of programs written in MLISP, 16K  of  data  in
35800	LISP,  and 16K of data in machine language with several overlays. The
35900	complete language recognition process requires less than  one  second
36000	of real time on a time-shared DEC PDP-10.
36100		A drawback to PARRY1 is that it reacts to the  first  pattern
36200	it  finds  in the input rather than characterizing the input as fully
36300	as possible and then deciding what to do based on a number of  tests.
36400	Another   practical   difficulty  with  PARRY1  from  a  programmer's
36500	viewpoint, is that, since it is a procedural model, elements  of  the
36600	patterns   are  strung  out  in  various  procedures  throughout  the
36700	algorithm.    It is often a considerable chore for the programmer  to
36800	determine  whether  a given pattern is present and precisely where it
36900	is.   In PARRY2 the patterns are all collected in  one  part  of  the
37000	data-base where they can easily be examined.
37100		Concentrating all the patterns in the data base gives  PARRY2
37200	a  limited  "learning"  ability.   When  an  input fails to match any
37300	stored pattern or matches an incorrect one,  as  judged  by  a  human
37400	operator,  a  pattern  which  matches  the  input can be put into the
37500	data-base automatically.  If the new pattern has the same meaning  as
37600	a previously stored pattern, the human operator must provide the name
37700	of the appropriate response function.  If  he  doesn't  remember  the
37800	name,  he  may  try  to  rephrase the input in a form recognizable to
37900	PARRY2 and it will name the response  function  associated  with  the
38000	rephrasing.  These mechanisms are not "learning" in the commonly used
38100	sense but they do allow a  person  to  transfer  his  knowledge  into
38200	PARRY2's data-base with very little  effort.
38300		Informal  observation  thus  far  shows  PARRY2's  linguistic
38400	recognition  abilities  to  be  quite  superior  to  PARRY1's. A more
38500	systematic and quantitative evaluation of performance  is  now  being
38600	carried  out.   PARRY1  was  extensively tested by having judges make
38700	ratings of its performance along several dimensions, one of which was
38800	linguistic noncomprehension (Colby and Hilf, 1974). These judges also
38900	made ratings of teletyped interviews with  psychiatric  patients  and
39000	with a random version of PARRY1. The mean ratings of PARRY1 along the
39100	dimension of  linguistic  noncomprehension  were  better  than  those
39200	received  by  RANDOM-PARRY  but  were three times worse than the mean
39300	ratings received by patients. Once the ratings of PARRY2  along  this
39400	dimension  are  completed, we will be able to compare them with those
39500	of PARRY1 and the patients and obtain a  more  objective  measure  of
39600	improvement.
39700	
39800	
39900	                  REFERENCES
40000	
40100	Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia.
40200		ARTIFICIAL INTELLIGENCE, 2, 1-25.
40300	
40400	Colby, K.M. and Hilf, F.D. (1974). Multidimensional Evaluation of
40500	       of a Computer Simulation of Paranoid Thought. To appear
40600	       in KNOWLEDGE AND COGNITION, L. Gregg, (Ed.)
40700	
40800	Wilks, Y. (1973). An Artificial Intelligence Approach to Machine
40900		Translation. In COMPUTER MODELS OF THOUGHT AND 
41000		LANGUAGE, R.C.Schank and K.M. Colby, Eds., W.H. Freeman,
41100		San Francisco.
41200	
41300	Winograd, T.A. (1972). A Program for Understanding Natural
41400	        Language, COGNITIVE PSYCHOLOGY, 3, 1-191.
41500	
41600	Woods, W.A. Transition Network Grammars for Natural Language
41700	        Analysis. COMMUNICATIONS OF THE ACM, 13, 591-606.